Using English as Pivot to Extract Persian-Italian Parallel Sentences from Non-Parallel Corpora

نویسندگان

  • Ebrahim Ansari
  • Mohammad Hadi Sadreddini
  • Mostafa Sheikhalishahi
  • Richard Wallace
  • Fatemeh Alimardani
چکیده

Ebrahim Ansari ([email protected]) et al. 2017. Using english as pivot to extract persian-italian parallel sentences from non-parallel corpora. In " Applications of Comparable Corpora " edited book Berlin Linguistic Press (ed.). The effectiveness of a statistical machine translation system (SMT) is very dependent upon the amount of parallel corpus used in the training phase. For low-resource language pairs there are not enough parallel corpora to build an accurate SMT. In this paper, a novel approach is presented to extract bilingual Persian-Italian parallel sentences from a non-parallel (comparable) corpus. In this study, English is used as the pivot language to compute the matching scores between source and target sentences and candidate selection phase. Additionally, a new monolingual sentence similarity metric, Normalized Google Distance (NGD) is proposed to improve the matching process. Moreover, some extensions of the baseline system are applied to improve the quality of extracted sentences measured with BLEU. Experimental results show that using the new pivot based extraction can increase the quality of bilingual corpus significantly and consequently improves the performance of the Persian-Italian SMT system.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extracting Persian-English Parallel Sentences from Document Level Aligned Comparable Corpus using Bi-Directional Translation

Bilingual parallel corpora are very important in various filed of natural language processing (NLP). The quality of a Statistical Machine Translation (SMT) system strongly dependent upon the amount of training data. For low resource language pairs such as Persian-English, there are not enough parallel sentences to build an accurate SMT system. This paper describes a new approach to use the Wiki...

متن کامل

Creation of comparable corpora for English-Urdu, Arabic, Persian

Statistical Machine Translation (SMT) relies on the availability of rich parallel corpora. However, in the case of under-resourced languages or some specific domains, parallel corpora are not readily available. This leads to under-performing machine translation systems in those sparse data settings. To overcome the low availability of parallel resources the machine translation community has rec...

متن کامل

استخراج پیکره‌ موازی از اسناد قابل‌مقایسه برای بهبود کیفیت ترجمه در سیستم‌های ترجمه ماشینی

Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...

متن کامل

Extracting an English-Persian Parallel Corpus from Comparable Corpora

Parallel data are an important part of a reliable Statistical Machine Translation (SMT) system. The more of these data are available, the better the quality of the SMT system. However, for some language pairs such as Persian-English, parallel sources of this kind are scarce. In this paper, a bidirectional method is proposed to extract parallel sentences from English and Persian document aligned...

متن کامل

Extracting Bilingual Persian Italian Lexicon from Comparable Corpora Using Different Types of Seed Dictionaries

Ebrahim Ansari ([email protected]) et al. 2017. Extracting bilingual per-sian italian lexicon from comparable corpora using different types of seed dictionaries. In " Applications of Comparable Corpora " edited book Berlin Linguistic Press (ed.). Bilingual dictionaries are very important in various fields of natural language processing. In recent years, research on extracting new bilingual lex...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1701.08339  شماره 

صفحات  -

تاریخ انتشار 2017